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Generalized linear models and the quasi-likelihood method ex- 
tend the ordinary regression models to accommodate more general 
conditional distributions of the response. Nonparametric methods 
need no explicit parametric specification, and the resulting model 
is completely determined by the data themselves. However, nonpara- 
metric estimation schemes generally have a slower convergence rate 
such as the local polynomial smoothing estimation of nonparamet- 
ric generalized linear models studied in Fan, Heckman and Wand [J. 
Amer. Statist. Assoc. 90 (1995) 141-150]. In this work, we propose 
a unified family of parametrically-guided nonparametric estimation 
schemes. This combines the merits of both parametric and nonpara- 
metric approaches and enables us to incorporate prior knowledge. 
Asymptotic results and numerical simulations demonstrate the im- 
provement of our new estimation schemes over the original nonpara- 
metric counterpart. 

1. Introduction. As an extension of the ordinary linear model, the gener- 
alized linear model (GLM) broadens techniques of ordinary linear regression 
to accommodate more general conditional distributions of the response. It 
was first introduced by Nelder and Wedderburn (1972). Its estimation is 
based on the iteratively reweighed least squares (IRLS) algorithm, which 
only requires a relationship between conditional mean and variance instead 
of its full conditional distribution. This feature was noticed by Wedderburn 
(1974). In this important further extension, Wedderburn replaced the log- 
likelihood by a quasi-loglikelihood function. This is usually referred to as 
the quasi-likelihood method (QLM). 
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In generalized linear models (GLMs) [McCullagh and Nelder (1989)], a 
typical parametric assumption is that a transformation of the conditional 
mean, referred to as the link function, belongs to some parametric family 
(say, linear or quadratic in the predictor variables). However, misspecifica- 
tion of the parametric family can lead to a completely wrong picture of the 
underlying conditional mean function. This deficiency of parametric mod- 
eling has long been realized in ordinary regression and applies to GLMs 
as well. It calls for an extension of nonparametric regression techniques 
to the GLMs. Green and Yandell (1985), O'Sullivan, Yandell and Raynor 
(1986), and Cox and O'Sullivan (1990) studied the extension to smooth- 
ing splines. Tibshirani and Hastie (1987) based their generalization on the 
"running lines" smoother. Fan, Heckman and Wand (1995) extended the lo- 
cal polynomial fitting technique and includes Staniswalis (1989) as a special 
case. 

Local polynomial smoothing is a useful technique to explore unknown 
structure in regression and dates back to Stone (1975, 1977). This area 
blossomed when Fan (1993) provided a deep theoretic understanding and 
discovered its elegant properties including the automatic boundary correc- 
tion. Here we focus on local polynomial techniques although the idea can be 
extended to other nonparametric methods. 

Nonparametric methods need no explicit specification of the form of the 
conditional mean for ordinary regression, and more generally, the link trans- 
formation of the conditional mean in the context of GLMs. However, they 
have in general a slower rate of convergence. In practice, prior knowledge 
or exploratory studies may provide us some prior information about the 
shape of the link transformation of the conditional mean. This information 
is ready to guide us in the nonparametric modeling process. In the literature, 
parametrically-guided nonparametric estimation methods were proposed to 
improve over its nonparametric counterpart in the context of density estima- 
tion [Hjort and Glad (1995); Naito (2004)] and least squares regression [Glad 
(1998); Martins-Filho, Mishra and Ullah (2008)]. The idea is very easy to ex- 
plain in the least squares regression case. Assume that the response Y, given 
a covariate X, has a conditional mean m{x) = E(Y\X = x). Once a paramet- 
ric estimator m(x,(3) of m(x) is obtained, any nonparametric method can be 
applied on {Yi/m(Xi,$),i = 1, 2, . . . , n} and {Yi — m(Xi, (3), i = 1, 2, . . . , n} to 
estimate m(x)/m(x, (3) and m(x) — m(x,/3), respectively. The corresponding 
two final estimators are given by the product of m(x, (5) and the nonparamet- 
ric estimator of m{x) / m(x , (5) , which serves as a nonparametric correction 
of the parametric estimator m(x,f3), and the sum of m(x,(3) and the non- 
parametric estimator of m(x) — m(x,/3), respectively. Theoretically these 
two parametrically-guided estimators are shown to achieve bias reduction 
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compared to the original nonparametric estimator when m(-) can be ap- 
proximated by the family {m(-,(3)}. 

Due to its nice property of bias reduction, it is desirable to extend this 
parametrically-guided estimation scheme to GLMs and QLM. However, for 
response with a general distribution other than normal, the regressands 
Y/m(X,f3) and Y — m(Y,f3) do not have a nice statistical property to fa- 
cilitate estimating m(x)/m(x,/3) and m(x) —m(x,j3) to make the straight- 
forward extension possible. In this work, we take on this problem and pro- 
pose a unified family of parametrically-guided estimation schemes for QLM. 
Asymptotic theory and numerical simulations are used to justify our pro- 
posed methods. In the literature, similar approaches have been used to re- 
duce variance. Cheng, Peng and Wu (2007) proposed to form a linear com- 
bination of a preliminary estimator to reduce variance in smoothing, and 
Cheng and Hall (2003) studied variance reduction in nonparametric surface 
estimation. 

The rest of the paper is organized as follows. Section 2 presents a funda- 
mental framework of GLMs and QLM. A unified family of parametrically- 
guided nonparametric estimation schemes is introduced in Section 3. Asymp- 
totic properties are developed to show their improvement over the original 
nonparametric counterpart in Section 4. Section 5 discusses how to select 
one parameter in the unified family. Section 6 gives a general pre-asymptotic 
bandwidth selector based on bias-variance tradeoff. Simulations in Section 
7 and real data analysis in Section 8 show our new schemes' finite sample 
performance in comparison to the original nonparametric method. We con- 
clude with a short discussion in Section 9. Technical proofs are given in the 
Appendix. 

2. GLMs and quasi-likelihood models. Let (Xx, Yj.), . . . , (X n , Y n ) be a set 
of i.i.d. random pairs where for each i, Y{ is a scalar response variable, and Xj 
denotes its corresponding (^-dimension explanatory covariates having density 
/x with support supp(/x) Q In GLMs, we assume that the response's 
conditional distribution belongs to a one-parameter exponential family 

(2.1) /y| X (y|x) = ex P (M(x) - 6(0(x))]/a($ + c(y, </>)), 

where a(-), &(■) and c(-, ■) are some known functions, <p is the dispersion 
parameter and 9 is the canonical parameter. For (2.1), the response has 
conditional mean /j(x) = &'(#(x)) and conditional variance var(Y|X = x) = 
a(0)6"(0(x)). 

Parametric GLMs assume that 77 (x) = g(fi(x)) and 77 (x) = 0o + x T /3 for 
some monotonic link <?(•). When the canonical link g = (b')^ 1 is used, the 
composition gob'(-) reduces to the identity function and 0(x) = 77 (x). In this 
case, (2.1) simplifies to f Y \x(y\x) = exp([yr](x.) - b(r)(x))]/a(<j>) +c(y,4>)). 
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In common practice, the full likelihood may be unavailable. However, 
the relationship between the conditional mean and variance may be readily 
available. In this case, estimation of /i(x) can be achieved by replacing the 
conditional log-likelihood log /y|x(y|x) by a quasi-log-likelihood function 
Q(/i(x),y). When we assume that var(Y|X = x) = V(/x(x)) for some known 
positive variance function V(-), the corresponding Q(fi,y) satisfies 

(2.2) UM = ±.Q { ^ y) = lZ^. 

More explicitly, Q(fj,,y) = Jy(y — w)/V{w) dw. For more details on QLM, 
see Wedderburn (1974) and Chapter 9 of McCullagh and Nelder (1989). 
The quasi-score (2.2) possesses properties similar to those of the usual log- 
likelihood score function. Note that the loglikelihood of (2.1) is a special 
case of quasi-likelihood function with V(-) = a(<p)b" o (&') _1 (')- 

Due its generality, we will focus on QLM. Fan, Heckman and Wand (1995) 
introduced nonparametric QLM by extending the local polynomial tech- 
niques. We will follow their framework and notation. To ease our presenta- 
tion, we focus on the one-dimension case as the extension to the multivariate 
case is straightforward. For the one-dimension case, our data consist of n 
pairs of observations {(Xi,Yi),i = 1,2,..., n}. 

To enhance flexibility, Fan, Heckman and Wand (1995) modeled rj(x) non- 
parametrically. For any xq in its domain, the local polynomial estimator of 
T](x ) is given by r)(x ) = V&olP, h ) = A) where $ = 0o,Pi, ■ ■ ■ ,P P ) T maxi- 
mizes the locally weighted quasi-likelihood function 



(2.3) Q((3) = Q(f3;h,x ) = ^TQ(g- 1 (Xjl3),Y i )K h (X i 



where, with slight abuse of notation, we define X, = (l,Xj — xo, . . . , (Xi — 
xo) p ) T and (3 = . . . ,/3 p ) T . Whenever there is no confusion, the extra 

arguments are dropped and Q(j3) is used, similarly for some other notation. 
Here p is the order of local polynomial fitting and Kh(-) = K(-/h)/h is a 
re-scaling of the kernel function K (•) with a smoothing bandwidth h. 

3. Nonparametric quasi-maximum likelihood with a parametric guide. 

As argued in the introduction, prior knowledge, physical model or exploratory 
analysis may give us some useful information that r](x) falls approximately 
into a parametric family {r)(x, a): a = (cti, ct2, ■ • • , ot q ) T 6 Ac R q }. In this 
section, we present a family of estimation schemes by incorporating the 
available useful shape information of ij(x) to guide us while estimating r](x). 
Within the parametric family rj(x, a), we find the optimal fit by maximizing 

n 

(3.1) ^QGT^p^a)),^) 

i=l 
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Fig. 1. Plots of true rj(-), estimated guide rj(-,d), difference n(-) — n(-,a) and ratio 
ri(-)/rj(-,d) for one random sample in Example 7.1. 



with respect to a £ A. Denote the best fit by r](x,d) where a is the maxi- 
mizer of (3.1). 

3.1. Bias reduction. In the local polynomial fitting framework, the bias 
is due to the approximation error of the Taylor expansion. The smaller ap- 
proximation error the less bias in the local polynomial estimator. Recall that 
we identify some parametric family {r](x,a) :a £ A} based on exploratory 
studies or prior knowledge and find the best fit rj(x, a) within this fam- 
ily. As a result, r](x, a) should capture the major shape of r](x) and con- 
sequently rj(x)/r](x,d) and rj(x) — rj(x,a) have less variation (smoother) 
than the original rj(x) does. Consequently, they are easier to be approxi- 
mated and the approximation errors in their corresponding Taylor expan- 
sions are smaller than those of the original function rj(x). For example, the 
true rj(-) is given by r)o(x) = 3sin(jX — |) + 6 for x £ [—2, 2] [as shown by 
the solid line in panel (A) of Figure 1] in our Poisson simulation Exam- 
ple 7.1. Nonparametric estimate f)(-) is given by the dotted line in Figure 
2 and indicates a parabolic shape. Hence we identify a parametric fam- 
ily, {rj(x, a) = a\ + a.2X + 03a; 2 : a = (a\, ot2, a^) T £ M 3 }, within which the 
best fit is given by the dotted line in panel (A) of Figure 1. The difference 
rj{x) — r](x,d) and ratio rj(x)/r)(x,a) are shown in panels (B) and (C) of 
Figure 1, respectively. We can see that the difference and ratio functions are 
much flatter than the original function rj(-) as desired. 

Based on the above argument, two different estimation schemes corre- 
sponding to multiplicative and additive corrections are introduced in Sec- 
tions 3.2 and 3.3, respectively. They are special cases of a unified family of 
corrections presented in Section 3.4. 
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3.2. Multiplicative correction. Consider the multiplicative identity 

r](x) = T](x,a)r m (x), 

where r m (x) = r](x)/r](x,a). When rj(x,a) is a good estimate, the ratio 
r m (x) becomes almost flat and allows the choice of a larger bandwidth. For 
any xq, we may estimate r m (xo) by maximizing local quasi-likelihood 

n 

£ Qig-^^Xua^Y^K^Xi - x ) 
i=i 

with respect to (3 and set f(xo) = 0o, the first component of the maximizer. 
Then ij(xq) can be estimated by rj(xo, a)r(xo). This two-step formulation is 
equivalent to the following one-step estimation. 

Locally approximating r m (-) by a polynomial function and re-scaling it 
by a factor rj(xo,a), we have the local quasi-likelihood 

Q m {(3) = Q m ((3; h, x , a) 

= E Q(9' 1 (^I^(X i ,d)/rj(xo,d)),Y)K h (X l - x ). 
i=i 

We maximize (3.3) with respect to (3, and set the final estimator fj m {xo) = 
Vm(xo]p,h,a) = Pq. In this formulation, the Taylor expansion is sup- 

posed to approximate r}(Xi)rj(xo, a) /r](Xi, a) locally at x = xq. This imme- 
diately justifies setting (3q as our estimator. 




-2 -1 1 2 -0.6 -0.4 -0.2 0.2 0.4 0.6 



Fig. 2. Plots of true 77Q nonparametric estimate fj(-) and two parametrically- guided 
estimates r) a (-) and r) m {-) for one random sample in Example 7.1 are shown on the left 
panel. A zoom-in view of the squared region is given on the right panel. 
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3.3. Additive correction. The other additive identity 

rj(x) =r](x,a) +r a (x) 

with r a {x) = t](x) — r){x, a) leads to another parametrically-guided nonpara- 
metric estimator rj a (xo) = fi a (xo;p, h, a) = (3q where f3o is the first component 
of the maximizer of 

Qa(P) = Q a ((3;h,x ,a) 

(3.3) 

= E Q(9~\v(Xi,a) - ri(xo, a) + Xj f3),Y i )K h (X t - x Q ) 
i=i 

with respect to (3. Similarly, the expansion X?p in this formulation targets 
at approximating rj(Xi) — rj(Xi,a) + rj(xo,d) locally at x = xq. Hence (3q 
estimates t](xq). 

3.4. A unified family of corrections. As in Martins-Filho, Mishra and Ullah 
(2008), we consider a more general identity r}(x) = rj(x,a) + r u (x)n(x,cx)" / 
with r u (x) = (jj(x) — rj(x, a))/r](x, a) 7 for some 7 > 0, we can estimate rj(xo) 
by Vu(xo) = r)(xo,a) +f u (xo)r](xo,a)' y . Here f u (xo) is given by the first com- 
ponent Po of the maximizer of 

n 

Y,Q(g~Hv(Xi,<*) +X.JPv(Xi,<x) 7 ),Yi)K h (X i -x ). 
1=1 

As in Section 3.2, an equivalent one-step estimation is available. Let /3q be 
the first component of the maximizer of 

Qu(P) =Q u ((3;h,x ,d) 

n 
i=l 

(3.4) 

+ (Kj/3 - V (x , a)) v (Xi, ay/ri(x , d) 7 ), Y$ 
x K h (Xi - s ) 

with respect to (3. Then (3q directly estimates rj(xo) and is the same as 
t) u {xq). We prefer (3.4) since it facilitates our theoretical development. 

Note that this unified estimator includes the additive and multiplicative 
corrections as special cases by setting 7 = and 1, respectively. 

4. Asymptotic properties. We assume that our data {(A,, Yi),i = 1, 2, . . . , n} 
are generated from the quasi-likelihood model with unknown true r/o(x). 
Asymptotic properties of our final estimates are achieved in two steps: es- 
tablish asymptotic properties with a fixed parametric guide in Section 4.1 
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and show that the same asymptotic properties apply to the case with an 
estimated parametric guide in Section 4.2. 

In the univariate case, we denote the marginal density of X by fx- The 
asymptotic properties for local polynomial estimator are different for xq 
lying in the interior of supp(/x) from for xo lying near the boundary. 
Suppose that K is supported on [—1,1]. Then the support of K^xq — ■) 
is £x ,h = {z:\z — xq\ < h}. We will call xo an interior point of supp(/x) 
if £x ,h C supp(/x) and a boundary point otherwise. If supp(/x) = [a,b], 
then xq is a boundary point if and only if xq = a + oeh or xq = b — ah 
for some < a < 1. Denote T> XOt h = {z : xq — hz £ supp(/x)} n [—1,1]. For 
any measurable set AcTZ, define v\{A) = J^z 1 K(z) dz. Let N P (.A) be the 
(p + 1) X (p + 1) matrix having entry equal to Ui + j-2(A), and let 

Mr tP (z;A) be the same as N p (_4), but with the (r + l)th column replaced 
by \l,z,..., Z P) T . Then for \N P (A)\ ^0, define 

(4.1) K r!P (z;A) = rl{\M r!P (z;A)\/\N p (A)\}K(z). 

When [—1,1] Q A, we will suppress A and simply write ui, N p , M riP , and 
K r ^ p . It can be shown that (— l) T K r ^ p {-; A) is an order (r,s) kernel as defined 
by Gasser, Miiller and Mammitzsch (1985) where s = p + 1 if p — r is odd, 
and s = p + 2 if p — r is even. It is an equivalent kernel induced by the 
local polynomial fitting [Fan and Gijbels (1995)]. This family of kernels is 
useful for giving concise expressions for the asymptotic distribution of local 
polynomial estimator for xo lying either in the interior of supp(/x) or near 
its boundaries. Denote p(xo) = {<?' (fi(xo)) 2 V ^/u(xo))} 1 • Note that when the 
model belongs to a one-parameter exponential family and the canonical link 
is used then g'(/j,(xo)) = 1/ var(Y|X = xq), and p(xo) = var(y|X = xo), if 
the variance function V(-) is correctly specified. The asymptotic variance of 
our local polynomial estimator depends on 

a^ S:P (x ;K,A) 

= var(Y\X = x )g'(p(xo)) 2 fx(x )- 1 [ K rtP (z;A)K StP (z;A)dz. 

J A 

Since the multiplicative and the additive corrections are both special cases of 
the unified family of corrections, we only consider the asymptotic properties 
for the unified family of corrections. 

4.1. Asymptotic properties with a fixed guide. Recall that our parametrically- 
guided nonparametric estimators are achieved by maximizing Q u (/3; h, xq, a) 
defined by (3.4). Note that the definition of Q u (/3; h, xo, at) involves a which 
corresponds to the best fit within the parametric family {rj(x,a),a. £ A} 
and depends on our data {(Xi,Yi),i = 1, 2, . . . , n}. This dependency conse- 
quently makes it intractable to directly study the asymptotical properties of 
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the maximizer of Q u (j3;h,xo,a). To avoid the complication caused by the 
use of the estimated d, we first consider the case with a fixed guide r](x, a). 

When a fixed guide rj(-,cx) is used, the asymptotic normality of the cor- 
responding estimator fj u (xo;p,h,a) are given by Theorem 1. 

Theorem 1. Let p > 0, 7 > 0, and assume that h = h n — > 0, nh 2p+1 — > 
00, and nh p+ < 00 as n — > 00. Under conditions (Al)-(A5) stated in 
the Appendix, if xo is a fixed point in the interior of supp(/x) satisfying 
t](xq,cx) / 0, then we have 

\J~nh d 
(4.2) ^[f)u(xo',P, h, a) - r) (x ) - Bias] -> iV(0, 1), 

where the bias term is given by Bias Q for odd p and Bias e for even p defined 
by 

Bias ^o^) 7 ^o(-)-^(^) V P+1) (Xu)/lP+ i 
BmSo ~ (p+1)! I ?? (-,a)7 ) [Xo)h 



and 



J z p+1 K^ p {z)dz^{l + 0{h)} 



B iaSe = { J z^K , p (z) d Zj -^^ { V ° { 'l { .^ a) ) iP+2) i-oMxo, a)' 



+ 



1 fvo(-) -v(-,&) 



(p + l)!V 



(p+i) 



{pr]{;ocpfx){xo) J 



dz 



x /i p+2 {l + 0(/i)}. 

Xf x o — *s 0/ i/ie /orm xo = x«5 + c/i satisfying r)(xo,a) 7^ where x$ 
is a point on the boundary of supp(/x) and cG [—1,1], then (4-2) holds 
with o~q q p (x$;K), and J z p+1 Ko tP (z) dz replaced by o~Q t0 p (xo; K,T> X0! h) , and 

Sv XOth zP+lK vA z i V xa,h) dz - 

Remark 1. Note that we use r](xo,a) in the denominator, which poses 
difficulty handling any zero point of ij(-,a), that is, xq satisfying r](xQ,a) = 
0. These zero points are ruled out in Theorem 1. Similar observation was 
made by Hjort and Glad (1995). However, this difficulty does not occur in 
our limited numerical experiments. 
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Remark 2. To simplify our presentation, we only state the asymptotic 
normality of the estimator for the function %(•)• However, our method can 
also estimate its high order derivatives, for which the asymptotic properties 
can be found in Proposition 1 in the Appendix. 

4.2. Asymptotic properties with an estimated guide. Note that our pro- 
posed estimation schemes use the best parametric fit rj(x, a) estimated based 
on our data instead of a fixed guide n(x,a). Compared to the simpler case 
with a fixed guide, the variability of parameter estimation now influences 
the asymptotic result. However, we shall show below that asymptotically 
there is no precision loss caused by the additional estimation step. 

Clearly, the parametric family used in the first step of our estimation 
schemes is most likely an incorrect specification. Consequently, the first- 
stage parametric estimator is a maximum quasi-likelihood estimator with a 
misspecified model. As in Hurvich and Tsai (1995), we denote the proposed 
parametric joint density of (Xi,Yi) and the corresponding actual unknown 
joint density by f(x,y;a) = f x {x)exp(Q(g- 1 (r){x,a)),y)) and f (x,y) = 
fx{x) ex.p(Q(g~ 1 (r]o(x)), y)), respectively, where fx(-) is the marginal den- 
sity of X. Denote by cxq, the pseudo parameter value that minimizes the 
Kullback-Liebler distance between f(x,y;a) and fo(x,y), that is, 



where the expectation E is taken with respect to the unknown true density. 

To proceed, we make regularity assumptions (B1)-(B5) given in the Appendix 
to assure that the pseudo-maximum quasi-likelihood estimator a. is y/n- 
consistent of olq, that is, y/n(a — olq) = O p (l) [see White (1982)]. 

Theorem 2. Under additional conditions (B1)-(B5), the asymptotic 
results, with a replaced by ocq, of Theorem 1 continue to hold when an 
estimated fit r](x,d) is used. 

Remark 3. Note that our theoretical results include those of the origi- 
nal nonparametric method in Fan, Heckman and Wand (1995) as a special 
case by setting a constant guide, say r](-,ot) = 1. For our parametrically- 
guided estimation, the asymptotic bias is determined by both the guide 
rj(-,a) and 7. When the same smoothing bandwidth is used, our theoretical 
results allow straightforward comparison between the original nonparametric 
method and our new parametrically-guided estimation schemes with differ- 
ent 7. Although the asymptotic variance remains the same, the advantage 




= argmm 
c*eA 
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of using a parametric guide is that it can reduce the asymptotic bias. For 
example, when p = 1 and h is the same, our parametrically-guided method 
asymptotically reduces integrated squared bias provided that 



(4.3) 



7>oJsup P (/ x ) V V ri(;a 



to 



< 



{Vo 2) (x)) 2 dx. 

supp(/x) 



However, this is only one part of the whole story because when the guide 
is appropriately selected, our parametrically-guided estimation schemes will 
select larger smoothing bandwidths (as the correction function is smoother) 
and, consequently, improve performance by reducing variance as well. 

5. Selection of <y. Equation (4.3) can be used as a general rule of thumb 
for identifying an appropriate parametric guide and selecting 7 by minimiz- 
ing 

(2) x 2 



/supp(/x) 

namely, the quantity on the left-hand side. 

In finite-sample applications, we obtain the best fit r](x,d) for each po- 
tential parametric guide family 17 (x, a) and use local polynomial smoothing 
to estimate the second order derivative function ( ^ifefey )^ ( x ) • Then we 
define 

(5.1) § 7 =[ Ux,dr( ^ ) ~i^ a) ) {2 \x)) 2 dx. 

Jsup P (f x )\ v ?n-,a0 7 / / 

Treating # 7 as a function of the parametric family and 7, we can find the 
best parametric guide and its corresponding best 7 by minimizing 0j. 

For the case of least squares regression, Huang and Fan (1999) studied 
convergence rate of nonparametric estimators of quadratic regression func- 
tionals such as the quantities on both sides of (4.3). We can apply their The- 
orems 4.1-4.4 to get the convergence of our plug-in estimator # 7 by noting 
that a converges to olq with a faster speed. However, the corresponding the- 
ory for the more general GLM and quasi-likelihood method is not available. 
A serious treatment for this kind of problem is very technical. It requires a 
full paper to address the issues and is beyond our current scope. 

In simulation examples of Section 7, for each example we generate 10 ad- 
ditional samples. Based on these 10 samples, we use the Extended Residual 
Squares Criterion (ERSC) [see equation (5.6) of Fan, Farmen and Gijbels 
(1998)] to select the smoothing bandwidth for estimating the second order 
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From our simulations, the improvement from the additive or multiplica- 
tive correction to the best 7 is much smaller than the improvement from the 
original method to the additive or multiplicative correction. In other words, 
the sensitivity of 7 on the performance improvement is not very high. This 
is due in part to the choice of the parametric guides which usually capture 
the main shape. Thus, in application, we suggest that a simple and effective 
method is to try a few discrete values of 7 [including additive (7 = 0) and 
multiplicative (7 = 1) guides as specific examples] and to pick the value of 
7 by the cross-validation. This will result in an improved performance over 
the vanilla nonparametric approach, if that approach is also included in the 
cross-validation comparison. 

6. Pre-asymptotic bandwidth selection. While optimizing (2.3) and (3.4), 
we need to tune the corresponding smoothing bandwidths. In this work, we 
will use the pre-asymptotic bandwidth selection method introduced in Fan 
and Gijbels (1995) and Fan, Farmen and Gijbels (1998) which is based on 
the bias-variance tradeoff. 

6.1. Estimating bias and variance. Without loss of generality, we use 
(3.4) to demonstrate the idea. It will include (2.3) as a special case by 
using a constant guide. In the remainder of this section, we denote = 
(3(xQ,d) = &rgm.&~x.pQ u ((3;h, xq, d) . The bias of the estimate f3 comes from 
the approximation error in the Taylor expansion. Denote the approximation 
error at Xi by 



Suppose that the (p + a + l)th derivatives of functions %(•) and rj(-,d) exist 
at xq for some integer a > 0. Further expansions of rjo(Xi) and r/(Xj, d) give 



r(X i )=r ] o(X i )-r ) (X i ,d) 





(Xi - a )P+J a 



Here the choice of a, the approximation order, will affect the performance 
of the estimated bias. Practically, it can be chosen as a = 1 or 2. 
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Now pretend that the approximated approximation errors ri are known. 
A more accurate local quasi-log-likelihood is 

Q*M = Q*M h, x , a) = Q(g~\vm + n)^K h {Xi - x ), 

i=l 

where T]i(j3) = rj(Xi,a) + (X.J/3 — r?(xo, Q)) ^?ft^ ■ The maximizer of the 
local quasi-log-likelihood is denoted by = (xo,a). Define 

= E T^TT^T 1 ^ >< (S-^iViW + r^K h (X - x Q ) 
fr[ V{g l {Vi{P) + ri)) 

and similarly Q*"(/3) = q^q^t Qu(P) to denote the gradient vector and Hes- 
sian matrix of the local quasi-likelihood Q* u , respectively. Applying Taylor's 
expansion to QfiJjS) around 0(xo,a), we get 

O = Q«(0*)kQ*J(0) + Q«'(0)(0* -0) 

which implies the following approximation of the estimation bias: 

(6.1) $(x ,&)-$*(xo,&) « {Qt\0)r l Qt(0)- 

Next we try to access the variance of the estimate 0. To obtain variance, 
note that 

= Q' u (0) « Q'u((3°) + Q'L((3°){0 ~ /3°), 

where (3° = /3°) T with $ = (^g^)^)^)^^,^"! + 

ry(a;o 5 ct)l{j=o} f° r J = 0, 1, . . . ,p. This implies that 

0-p°xi -Q^(/3°) _1 Q^(/3°), 

and an approximation for the conditional variance is given by 

var(/3|X) « var(QU/3°)|X)Q^( / 9 )- 1 . 

Here the Hessian matrix can be approximated by C}"(/3), and the variance 
term can be approximated as follows: 

var(QU/3o)|X) =X;vax^Q0/- 1 )(^G9) ) y i )|x i ) _ ^K 2 h (X t - x ) 



i=l 



E& x ( x W (x ( - IO )(^| )7 



7 x2 



i=l 
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x, 



/3=/3° 



Note that Xj has significant weight only in a neighborhood around xq, and 
for such i, 

Ci^iig-'Yivoi^/vig-^voixo))). 

Consequently, we have 
(6.2) 
where 



var(QJ/3 )|X) « 777-3777 77 ^ S n , 



V(g-i( m (x ))) 



S n — '^2 XjXf if^ (Xj 



£0 



i=l 



Vr?(^o,a) 7 / 



Combining the above results, we get 



var(/3|X) 



Krt'Mzo))] 2 , 

V'/'/aO\-lc /o"/'/aO\-l 



Vig-^voixo))) 



where the unknown 770(^0) an d /3° can be replaced by their estimates fj u (xo) 
and (3, respectively. 



6.2. Bandwidth selection via bias-variance tradeoff. Based on the above 
arguments, we first select a pilot bandwidth /i* +a+1 _ +a , which can be cho- 
sen using the ERSC. Next we fit a local polynomial with degree p + a + 1 

and bandwidth h* +0+lp+a to get an estimate 0~ P+ ^ = (fio, (3i, . . . , (3 p+a ) T 

via maximizing quasi-log-likelihood function (3.4). Using f3^ P+ , we get the 
approximation error rj and hence the estimated bias B p q(x\ h) and variance 
V p fl(x; h) of Po which are respectively the first elements of the estimated bias 
vector (6.1) and variance matrix (6.2). An estimator of the mean squared 
error(MSE) of (3q is given by 

MSE P;0 (2:o; h) = B^ (x ;h) + V pfi (x ;h) 
which leads to our final bandwidth selector 



(6.3) 



h P Q = argmin / MSE p q(x; h) dx. 
h J 
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7. Monte Carlo study. In this section, we use simulations to illustrate 
the improvement of our newly proposed estimators by comparing them with 
the original nonparametric method. For simulations in this section and real 
data analysis in next section, we use the canonical link and local linear fitting 
by setting p = 1 . To access the bias term in the pre-asymptotic bandwidth 
selection as discussed in Section 6, we choose the order of the approxima- 
tion to the Taylor expansion error to be a = 2. In simulation studies, we 
first generate ten independent data sets, on which the pre-asymptotic band- 
width selector based on a grid search is applied. We set our final selected 
bandwidth to be the median of the obtained ten bandwidths, and it is fixed 
and used in our simulation. This speeds up the computation considerably. 
Different methods with their corresponding selected bandwidths are applied 
to another R = 1000 independent data sets and results are reported. When 
necessary, the Epanechnikov kernel is used in all of our numerical examples. 
For 7 in the unified family, we use a grid T = 0, 0.1, 0.2, . . . , 1, 1.2, 1.4, . . . , 5. 

Example 7.1 (Poisson). Each observation pair (X, Y) in this example is 
generated in two steps: (1) the predictor variable X is marginally uniformly 
distributed over [—2,2]; (2) given X = x, the response Y is generated from 
Poisson distribution with mean exp(r/o(x)) where r]o(x) = 3sin(^x — ^) + 6. 
Each sample consists of 100 i.i.d. pairs of observations. We estimate r/o(-) 
over J = 100 uniform grid points {xj}J =1 on the interval [—2,2]. We use 
three different parametric guides: Gf = a\ + 022; + a^x 2 , G^ = oti + o>2X + 
a%x 2 + a^x 3 and = a± + 02 sin(^:r — ^). 

For an estimate f] r {-) with r indexing the replication of the simulation 
study, we define the bias Bj = J2?=i[Vr(xj) — Vo( x j)]i the variance Sj = 
J2?=i[nr(xj) — R^ 1 J2r'=i Vr'(xj)] 2 and the mean square error MSEj = 
B 2 + Sj at each jth grid point xj. Let B 2 = J" 1 J2j=i B 2 , V = J" 1 J2j=i Sj, 
MSE = J" 1 J2j=i MSEj be the averages of the squared bias, variance and 
mean squared error (MSE) of the estimate rj(-), respectively. In Table 1, we 
report for different guides, the squared bias, variance and MSE of the original 
method, additive correction, multiplicative correction and the unified family 
of corrections with the best 7. The best 7 corresponds to the one that 
minimizes MSE over the grid V and is given by 3.2, 3.0 and 3.0 for Gf , G^ 
and G% , respectively. The last block corresponds to 7 tuned by the selection 
method proposed in Section 5. The tuned 7 is given by 1.8, 1.8 and for Gf, 
G2 and Gf, respectively. The top panel "best /i" means that each method 
uses its corresponding best smoothing bandwidth while the lower half "same 
/i" corresponds to the case of using the same smoothing bandwidth selected 
by the original method. 

The lower panel indicates that parametric guide reduces bias but has lit- 
tle effect on variance when the same h is used for different methods. This is 
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consistent with our asymptotic results. However, when the individual best 
h is used for different methods, parametric guide also reduces variance as 
shown in the top half of Table 1. The underlying reason is that an appro- 
priate parametric guide helps to make the nonparametric correction term 
flatter and smoother and a larger h is allowed. This in turn reduces the vari- 
ance. Table 1 indicates that the tuned 7 does not perform as well as the best 
7, however, it improves over the additive and multiplicative corrections for 
either the quadratic or cubic guide. Note further that the improvement from 
the additive or multiplicative correction to the best 7 is much smaller than 
the improvement from the original method to the additive or multiplica- 
tive correction. Based on this observation, we recommend using the simple 
method outlined at the end of Section 5 to select 7 for real applications. 

We plot B 2 , V and MSE for our parametrically-guided nonparametric 
estimation with different 7s in Figure 3 with the far-left isolated one cor- 
responding to the original nonparametric estimation. Panels (A) and (C) 
correspond to cubic guide while panels (B) and (D) use the true sinusoid 
guide. The smoothing bandwidth is fixed at a same value for all estimation 



0.2 



0.15 



0.1 



0.05 



(A) Cubic guide and best h 



**** 



**************-H- 



original 



0.2 



0.15 



0.1 



0.05 



(B) Sinusoid guide and best h 



_^i t- } 'I- 



************* 



original 2 

Y 



0.2- * 



0.15 



0.1 



0.05 



(C) Cubic guide and same h 



^wososoxxxxxxxxxxxxxxxxxxxx 



original 



0.2 



0.15 



0.1 



0.05 



(D) Sinusoid guide and same h 



^^^^ 



original 2 

7 



Fig. 3. Plots of B 2 , V, and MSE denoted by black x, blue + and red *, respectively. 
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methods in panels (C) and (D) while each individual estimation uses its cor- 
responding best smoothing parameter for panels (A) and (B). The figures 
give a picture of the results summarized in Table 1 . 

For a random sample of size 100, the best fit within the quadratic family 
is shown by the dotted line in panel (A) of Figure 1. The true unknown 7/o(-), 
nonparametric estimate fj(-), two parametrically- guided estimates f) a {-) and 
fj m (-) are given by the solid, dotted, dashed and dot-dashed lines, respec- 
tively, in Figure 2. From this, we can see that parametrically- guided es- 
timates improve the nonparametric counterpart around x = where the 
curvature of 770 (") is large and makes nonparametric estimation difficult. 

Example 7.2 (Bernoulli). In this example, we consider Bernoulli dis- 
tribution. The predictor variable X is generated from Uniform[— 1, 1]. Con- 
ditioning on X = x, the response Y is generated from Bernoulli distribu- 
tion with success probability exp(^o(^))/(l + exp (770(3;))) where rjo(x) = 
2sin(-7rx). In this case, we consider samples of size 500 for two reasons: 
(1) the estimation of Bernoulli success probability is harder than the case 
of Poisson; (2) the use of a full sinusoid true 770 (x) makes it even harder. 
Function rjo(-) is estimated over a uniform grid with J = 100 points over 
[— 1, 1]. The average of squared bias, variance and MSE are reported in Ta- 
ble 2 for three guides Gf = ol\ + a^x, Gf = a\ + a^x + a^x 2 + a^x 3 and 
Gf = ai + a 2 sin(7rx). The best 7 is 1, 0.7 and 0.6 for Gf , Gf and Gf , 
respectively. The tuned 7 is given by 0.8, 0.6 and 0.7 for Gf, Gf and Gf, 
respectively. 

Note that no improvement is observed for the additive correction with 
a linear guide a± + ct2X in Example 7.2. We can resort to our theoretical 
results to understand this exception. As we use local linear fitting, theo- 
retically asymptotic bias depends on the second-order derivative of 770 (") — 
r/(-,ao) + r)(xo, olq) and rjo(-)r](xo, olq) /rj(- , olq) for additive and multiplica- 
tive corrections, respectively. A linear guide cannot reduce the second-order 
derivative of r?o( - ) — v('i a o) + r ?( 3; Cb Q: o) and consequently does not reduce 
bias. However, a linear guide slightly reduces the second-order derivative of 
rjo(-)rj(xo, ocq)/t](-, cxq) and improves the corresponding performance. This 
is consistent with our numerical results in Table 2. Note further that the 
multiplicative correction performs the best among the unified family of cor- 
rections when the linear guide is used. 

8. Real data analysis. In this section, we apply our newly proposed para- 
metrically guided nonparametric estimation schemes to the Financial Aid 
Award Data, provided by National Longitudinal Survey of the High School 
Class of 1972. The data set is available online, and interested readers may 
find more information about this data set at http:/ /www. oswego.edu/ kane/ 
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Note: All entries for squared bias, variance 


and MSE are multiplied by 100. 
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Note: All entries for squared bias, variance, and MSE are multiplied by 100. 
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econometrics/finaid.htm. There are twenty variables. We are interested in 
using SAT score (X) to predict whether a student received financial aid 
grants. There are 3076 students in total with SAT scores between 600 and 
1300. Out of these 3076 students, 916 students received some financial aid 
grants. The binary response Y is coded in this way: Y = 1 means that a 
student received financial aid grants and Y = otherwise. 

The pre-asymptotic bandwidth selector gives bandwidths 258.4615 for the 
original nonparametric GLM. The corresponding nonparametric estimate of 
the log odds ratio log pRfegft^H is given by the dot-dashed line in Figure 
4. Based on this, we choose a cubic guide for two reasons. First, from the 
simulation we know that the linear guide does not help at all in some cases. 
Second, the nonparametric estimate does not indicate a quadratic shape. 
Thus we apply the parametrically-guided logistic regression with a cubic 
guide. The pre-asymptotic bandwidth selector gives bandwidths 296.1538 
and 296.1538 for the parametrically-guided additive and parametrically- 
guided multiplicative methods, respectively. Result is summarized in Figure 
4. Cubic parametric estimate of the log odds ratio is given by the solid line; 
our parametrically-guided estimates are given by the dashed and dotted lines 
for additive and multiplicative methods, respectively. 

We observe that our parametrically-guided additive and multiplicative es- 
timates follow the cubic fit very closely. This suggests that there is no model 
specification error by using a cubic model. However, the nonparametric es- 
timate differs from the cubic fit for lower SAT scores. 

9. Discussion. In this work, we extend the methodology of parametrically- 
guided nonparametric estimation to GLMs and QLM. Asymptotic properties 




'600 700 800 900 1000 1100 1200 1300 
SAT 



Fig. 4. Plots of the nonparametric estimate, cubic estimate and our parametrical- 
ly-guided estimates of the log odds ratio function for the Financial Aid Award Data. 
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and numerical evidence demonstrate its improvement over the original non- 
parametric estimation scheme. There are possible extensions. For example, 
the whole estimation scheme can be easily extended to multivariate varying- 
coefficient and additive models. This enables us to incorporate prior knowl- 
edge into the analysis of multivariate nonpar ametric models, ameliorating 
the issues of curse of dimensionality. 

APPENDIX: CONDITIONS AND PROOFS 

Let qi(x,y) = (d t /dx' l )Q(g~ 1 (x),y). Note that qi is linear in y for fixed x 
and that qi(r)o(x ), n(x )) = and <?2(>?oOo), K x o)) = ~p{x ). 
The following technical conditions are imposed: 

(Al) The function q2(x,y) < for x S K and y in the range of the response 
variable. 

(A2) The functions f' x ,rj^ p+2) ,£^r](x,a),var(Y\X = -),V" and g'" are 
continuous. 

(A3) For each x £ supp(fx), p(x),v&i(Y\X = x) and g'(fi(x)) are nonzero. 
( A4) The kernel K is a symmetric probability density with support [—1,1]. 
(A5) For each point x$ on the boundary of supp(/x), there exists an interval 
C containing xs having nonnull interior such that inf x£ c fx(x) > 0. 

White (1982)-type conditions: 

(Bl) Elog(fo(x,y)) exists and there exists a mi(x,y) such that |log(/(x,y; 

a)) | < nii(x, y) for any a £ A and Em\(x, y) < oo. 
(B2) E '(log( j o(x, y) / f(x,y; a))) has a unique minimizer ao- 
(B3) log f{x, y; a.) is continuously differentiable in a for j = 1, 2, . . . , q. 

(B4) There exist m 2 (x,y) andm 3 (x,y) such that | ^ log f(x, y; a)-£^ log/(x, 
y;a)\ < m 2 (x,y) and \ da f da . log f(x, y; a) \ < m 3 (x,y) for any a £ A, 
1 < i,j < Q- Furthermore, both Eiri2{X, Y) and Em^{X,Y) exist. 

(B5) Assume that cxq is an interior point of A; the matrix (E-^- log f(x, y; 

a)-r^-logf(x,y;a))i<i t j< q is nonsingular at o.q; cxq is a regular point of 
matrix (E qJ^Qq, . 

For the case of unified correction with a fixed guide r](x,a), denote f3 = 
f3(xQ,a) = argmax^ Q u ((3; h, xo, a). Because (3 is calculated using A, near 
xq, we expect that 

r]{Xi,a.) + (j3 H h P P (Xi - x ) p - i](x ,6i))r](Xi,ay /i](x ,ay 

~Vo(xo) +Vo(x )(Xi - x ) H h?7o ( x o)(Xi ~ x ) p /p\. 
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Consequently, we expect that (3q — > t]o(xq) and 

&->• ( ^T.l'j^y (x ) V (x ,<x)yj\ for l<j<p. 

We define </> tt)7 (a;) = (t/o(x) — r/(x, a))/ry(x, a) 7 to simplify our notations. 
We thus study the asymptotic properties of 

0* = (nh) 1 / 2 ^ - rjo(x o ),h 1 1 - r,(x , a^g^o)}, . . . , 

so that each component has the same rate of convergence. Let Q P (A) and 
T p (A) be the (p+ 1) x {p + 1) matrices having (7,j)th entry equal to fi + j-.\{A) 
and r^+J'- 2 ^ 2 ^)^. Also, define D = diag(l, 1/1!, . . . , l/p!), E X (A) = 
p(x)f x {x)r>N p (A)B, 



fx(x) var(y|X = x) „ 



7] z ^{x, a) 



and 



^ 1) (x )(pr ? 27 (-,a)/x) / (x ) 
(p + l)!(p77T(.,a)/ x )(x ) 

x ^+ 2 Kj- llP {z; A) dz - (j - 1) J z p+x Kj^ p {z; A) dz 
-- [ z p+1 K j ^ liP (z;A)dz [ z p+1 K P!P (z;A)dz 



Let h xo (A) be the (p+ 1) x 1 vector having jth entry equal to V nh 2 P +3 aij (A) + 
Vnh 2 P+ 5 a 2 ,j(A). 

Main theorem 1. Suppose that conditions (Al)-(A5) hold and that 
h = h n — > 0, nh 2p+1 — ► oo, nh 2p+s < oo as n — > oo. If xq is an interior point 
of supp(/x), andp>0, then 

{s x ,([-i,i])- 1 r x ,([-i,i])s :[ ,([-i,i])- 1 }- 1/2 

x{(3* - b X0 ([-l, 1]) + o(Vn~h^)} 3 iV(0,I p+1 ). 
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If xq = x n is of the form xq = x$ + he where c£ [-1,1] is fixed and x$ is a 
fixed point on the boundary of supp(/x) , then 

x{(3* - b X0 (V X0 , h ) + o(y/nh*P+5)} 3 N{0,I p+1 ). 

The proof of the main theorem follows directly from Lemmas 1 and 2, 
which are stated and proved as follows. Denote Zj = (1, (Xj — Xo)/h, . . . , 

(x i - XQ y/(hv P \)) T . 

Lemma 1. Let fj(x ,x) = n(xo,ayY^=o < Pa]-y( x o)(x-x ) j /jl andW n = 
{nh)- l / 2 Y2=i^t where 

Then under conditions (Al)-(A5), nh 3 — > oo and h— ► 0, we have 
0* = S-iWn - h-E^A X0 ^W n + o P (h). 

Proof. Recall that (3 maximizes Q u (f3;xo,p,a). Let 

(3* = (nh) 1 / 2 ^ - n(x ),h l {f3 1 - rj(x , a^g^o)}, 
...,hV{p\[i p -n(x Q , a y^(x )}) T . 

Then 

n(X- a.) 1 

n(X u a) + j^-L{p +fc(Xi-xo) + '-' + P p (X t - x f - rj(x , &)} 

= V {Xi,a) + ^ (Xi ' a) ^ (ao,*i) + a n p* T Z,}, 
rj(x ,ap 

where a n = (n/i)" 1 / 2 . If 3 maximizes Q u (j3;xo,p, a), then f3* maximizes 



as a function of f3* . To study the asymptotic properties of j3 , we apply the 
quadratic approximation lemma [Fan and Gijbels (1995)] to the maximiza- 
tion of the normalized function 

UP*) = fjyQ (g- 1 (v(Xi,<*) + ( ^'^ y^K Xi) + a n /3* T zo) , 
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-Q( 9 ->(,(* i ,«)+(^)Vo,*,)),*)} 



x K 



Xj - Xp 

h 



Then /3 maximize l n . We remark that condition (Al) implies that l n is 
concave in (3* . Using a Taylor series expansion of <2(g -1 (-)> Y-i), 



xp^ZiKUXi-xoyh} 



7 



(A.l) 



a 2 » (rjiXuo^yy ( , + {v(Xu 



a 



rj(x Q ,Xi),Yi 



x (/r^fi^x* - x )/h} 



xK{(Xi-x )/h}, 



\r](x ,a) 



*Try \3 



where rji is between rj(xo,Xi) and r?(a;o>Aj) +a n /3* r Zj. Let 

2 t / /nrsr, rvnT 



r?(x ,Xi),Fj 



x J FC{(A i -x )// i }Z i Z 



Then the second term in (A.l) equals 2/9* A n /3*. Now (A n )jj = (EA n )ij + 



Op(y var((A n )jj)) and 

^-^{(^^(^-^(^^■"'■^ 

xif{(X 1 -z )A}ZiZf 

since (72 is linear in y for fixed x. Because supp(A) = [—1,1], we need only 
consider \Xx — xo\ < h, and thus 

n {Xy , a) + ( ^ Q . } ) , Ai ) - % (X x ) 

= A X , ay | -— L-^ 1 ) (x ) (A a - * ) p+1 
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Then 

(i-iy.ij-iy.iEA^j 



hT l E 



xq 2 [ri(X 1 ,a)+ ^ / ) ^o, ^1), M^F 



x A" 

/r/(xo + /i2 , ,Q:)\ 27 



X 1 -x o\/A- 1 -x \^- 2 



V »7(aJO)«) 
x g 2 + ^Z, 

+ ( ^ X ° + hZ > a } \ fj(x ,x + hZ),iJ,(xo + hZ) 



V ??(^o,«) 
x K(Z)Z l+j - 2 f x (x + fcZ) dZ 

27 



^faofco + fcZ) + o(hP),Kx + hZ)) 



x K(Z)Z i+j - 2 f x (x + /iZ) c2Z 
^(xq + ZiZ,^^ 



rj(x ,a) 

x A(Z)Z i+i_2 /x(a;o + hZ) dZ 
i](x + hZ,a^\ 2 ^ 



Mvoixo + hZ),n(x + hZ)) + o{W)\ 



p(x + hZ)f x (x + hZ) 

r]{x ,a) J 

x K{Z)Z i+j ~ 2 dZ + o{h) 

it \t \ (p?? 27 (-,q)/x) / (xo) 
= -(pfx){xo)Vi +j -2 - h ~2^u^aj + °( h >- 

Similar arguments show that var{(A n )^,} = 0{(n/i) _1 } and that the last 
term in (A.l) is Op{(n/i)~ 1 / 2 }. Therefore, l n (J3*) = - \j3* T {^l X0 + 

hA X(j )P* + op(h) because n/i 3 — ► oo and h — > 0. Similar arguments show 
that ^ ((3* ) = W n - (S xo + /iA X0 )/3* + op (h) and Z£ (/3* ) = - (£ xo + hA X0 ) + 
op(h). The result follows directly from the quadratic approximation lemma 
of Fan and Gijbels (1995). □ 
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Lemma 2. Suppose that the conditions of Theorem 1 hold. For W n as 
defined in Lemma 1, 

{S" 1 - h^A X0 ^}E(W n ) = b xo +o{(nh 2 P+ 5 ) 1 / 2 }, 

r;V2 cov(Wn)r -i/2^ Ip+i 

and 



r-V 2 (w„-EW n )^iV(o,i 



p+D 



Proof. We compute the mean and covariance matrix of the random 
vector W n by studying Y*, as defined in Lemma 1. Denote (E~Y*)i to be the 
mean of the ith component of Y*. Then it is easy to show that ~I (£Y*)j 
is equal to 



V rj(x ,cx) 
x Z i ~ 1 K(Z)f x {x + hZ)dZ 
Now by the Taylor expansion, 



qi (v(x + hZ, a) + i]( x o,xo + hZ),fi(x + hZ) 

+ ^M(/,zr 2 + o(h^)]p(x + hZ) 

+ o(h p+2 ). 

Thus 

{EYlh ~ V (x , a p^-l)\ v (p + 1)! P+4+ C ^ Xo) ^ +1 
(A.2) 

+ o(^+ 3 ), 



where 



^ 0j -(p + 2)!^ iX0j+ (p + l)!(p^(,a) /x )( S0 ) • 
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Note that 

/ n \V2 1 



n\ 1 ' 
h) 



D -1 N -1 hP+ 2_^^+l) {xo)rt{x0tCxr 



x (Vp+1, Vp+2, ■ ■ ■ , V2p+l) T 

+ h p+3 ( p (x )ri(x , a) 7 (f p+2 , v p +3, ■ • ■ , ^2 P +2) T 
The ith component of Y}~^EW n is 

(^EW n )i = (J) (t - l)!({iV- 1 Ki, {iV- 1 }^, • • • , {iV- 1 }^!) 

1, l/p+2, • • • , ^2p+l) 

+ /t P+3 Cp(^0)??(2;0, Q) 7 (^p+2, Vp+3, V2p+2) T 
= ( ^2p + 3 ) l/2_^^p+l) (xo)r?(x0)Q)7 | z P+l K ._ lp{z)dz 



+ (nh 2 P +5 ) 1 / 2 ( p (x )7 1 (x , a y J Z P+ 2 K^ ltP (z)dz 

+ o{{nh 2 P +b ) l l 2 }. 
Next, consider the second term in the expression 

x ^(N^Q^- 1 ),^ + 0{(n/^+ 7 ) 1/2 }- 

Using the fact that (Q p )m = (N p )k,l+i for I <p+ 1, it can be shown that for 
i = 2,...,p+l, 

(N^QpN" 1 ^- = CN- 1 )*-^ + j^N; 1 ),,^ j(N;Vi,. 
and by similar reasoning, 

(N^QpNj 1 )^ = |E(N- 1 ) 1|fcI / p+fc |(N; 1 ) p+ .i J . 
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So by Lemma 3 of Fan, Heckman and Wand (1995), 

P+i 

(i-l)!^(N; 1 Q p N; 1 ) i ^ p+J 



+ -J Z v +1 K PjP (z)dz J zt> +1 K t _ ltP (z)dz. 

The statement concerning the asymptotic mean follows immediately. By 

(A. 2), the covariance between the ith and jth component of Y* is £?((Y*)j(Y^)j) + 

0(h 2p+4 ). By a Taylor series expansion, E((Y*)i(Y*) j) is given by 



E 



\r]{xo,a.) 



{ (x l -x )/hy+i- 2 

x 



xfj(x ,x + hZ),Y 1 )K(Z)^j 



Z i+ i- 2 fx(x + hZ)h 
(i-l)I(j-l)! 



| feC^xo + fcz), yi)K(z)) 2 ^ J n^rw dz + 



Noticing that 

Olivia + hZ),Yi) 

Yi- g^ir/ixo + hZ)) _ ly 
^ rfa-'fafo + tZ))) (g ><'<*» + **»• 

we can derive 

r / v * u /i/x(^o)var(y|X = x ) f z l+J ~ 2 2 . 

{cov Y i )}ij= rT/ / 7 vwTf / j- — TW- — rw (Z)dZ + o(h). 

Therefore, r~ 1//2 cov(W n )r~ 1 / 2 — >■ I p+1 . Now, we use the Cramer-Wold 
device to derive the asymptotic normality of W n . For any unit vector u € 
W+ 1 , if 

(A.3) (na 2 )- 1 / 2 u T cov(Yt)- 1 / 2 (W„ - EW n ) >d N(0, 1) 
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then / i 1 /2 cov ( Y J)-i/2( Wrt _ £ Wn ) >£, N(0,I p+1 ), and so T~V 2 (W n - 
EW n ) — >d iV(0,Ip+i). To prove (A. 3), we only need to check Lyapounov's 
condition for that sequence which can be easily verified. □ 

Noting that fa Vo (x ) and $j v(xo,*V( \£)? ) U Hxo)/?- f ° r 1 < 
j < P, we can define the estimator of t]q\xq) iteratively as f)uo( x o]Pi h, a ) = 
fj u (x ;p,h,a.) = /3 and 



Vu,j(xo;p,h) =jl(3j - rj(x , a) 7 ^2v Uti (x ;p, h)(l/rf r )^ l \x ,a)^ 

+ i](x ,(xy(ri 1 -' y )^(xo,<x) for l<j<p, 
where (?) =j\/{i\{j — £)!). Simple algebra leads to 
fju,j(xo;p,h) - t^\xq) 

j-l 

- r)(x , a=) 7 ^2(r)u,i( x o;P, h ) ~ W ( x o)) 

i=0 

X(W) 0-) (X0)Q) Q for l<j<p. 

Denote to 3 - = fj u j(x ;p,h) - rj^\x ), Vj = jlfy - rj(x , «) 7 (^r^7^) (j) 
(a;o)-T/(xo,a)l{ 3=0 } and 



(l/r ? 7 )^-*)(xo,a)7 7 (xo,a) 7 ) 



where l{j=o} = 1 when j = and otherwise. 

Let L be a (p+ 1) x (p+ 1) matrix. For < i, j <p, its (i+ 1, J + 1) element, 
denoted by Ljj, is defined as follows. Set Ljj = 1 when i = j, Ljj = when 
2 < j and 

i-j-l 

i=l j<k 1 <k 2 <---<k l <i 

when i > j. Then iOj = u,- - Ei=o ^j'^i = Ei=o L j,j-i v j-i = Ei=o 
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With the above notation, we have 



(A.4) 



L x 



110! -r/(x ,Q) 7 



p\$ p - rj(x ,cxf 



0!/?o ~Vo(xo) 

rj - rj(-,a)\ (1) 



7?(-,a)7 



(^o) 



(xo) 



The above equation allows us to study the asymptotic bias and variance 
of fj Uj j(xo;p,h) — ?]q\x) using those of 0o — i]o(xo) and i\0i — r)(xo, a) 7 
(^ffj^fHxo) where l<i<p. 

Proposition 1. Letp — j>0 and suppose that conditions (A1)-(A5) 
stated in the Appendix are satisfied. Assume that h = h n ^0, nh 2p+1 — > oo, 
and nh 2p+s < oo as oo. If xq is a fixed point in the interior o/supp(/x) 
satisfying r](xo,a) / 0, then 



Vnh 2 i~ 



(A.5) 



a j:j , p (x ;K) 



T]o-ri(;a) 



U) 



(xo) 



J7(.,a)f 

T](x ,a)l {j=0} - Bias(j)^j % N(0, 1), 

where the bias term Bias(j) is given by Bias Q (j) when p — j is odd and 
Bias e (j) whenp — j is even with definitions 

(p+T)T { r,(; a )i ) (xq)v(xq> ex.) (j z^ K hp (z) to 



Bias Q {j) 



x{l + 0(h)} 



and 

Bias e (j) 



r?(x ,Q!) 7 frjo- v(-, a )\ {p+2) 



zP+2K ^ z)dz \p + 2y. v v(,«P 

+ (J z p+2 K lp (z)dz-j J z^Kj.^dz 



(xo) 

1 

(p+1) 



V0-V(;<*) \ (P+1) ( ^ ) (pr ? 27 (-,Q)/ x )^(x ) | ^ p _ i+2 



7/(-,a) 



(prff(;a)fx)(x ) J 
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x {l + 0{h)}. 

Based on (A. 4), we get the asymptotic distribution result for our estimates 
fju,j(xQ;p, h, a) as follows: 

(A.6) —(rj U!j (x ;p, h) - rjg\x Q ) - Bias(j)) ^ N{0, 1). 

a j,jA x o> K ) 

U xq = x n is of the form xq = x$ + ch satisfying rj(xo,a) ^ where x$ 
is a point on the boundary of supp(/x) and cS [—1,1], then (A.6) holds 
with a^ sp (xo;K) and J z p+l K r>p (z) dz replaced by a 2 . sp (xo;K,V XQ: h) and 

Proof of Proposition 1. The result (A.6) in Proposition 1 follows 
from the main theorem by reading off the marginal distributions of the 
components of $*. To calculate the asymptotic variance, we calculate the 
(r + 1, s + 1) entry of r!s!Np(i)" 1 T p (i)N p (i)- 1 as 

rW EElSwW)^ = / K r , p (z;A)K s , p (z;A)dz, 

where c« is the cofactor of {N p (A)}ij . The equation comes from the follow- 
ing argument: 

P+ip+i 

y Cr+l,kCs+l,l{T p (A)}kl 
k=l 1=1 



p+lp+1 

/ ^^ C r+l,kCs+l,lZ k+l ' 2 K(zf dz 

•J J\ 7. 17 1 



k=l l=l 



, /P+l \ /P+l 

JA \k=l ) \k=l 



z k ~ x K{z) I dz 



\M r , p (z;A)\\M SjP (z;A)\K(z) 2 dz. 

The asymptotical results in (A.6) for rj u j(xo;p,h,ot) are easily proved by 
noting that its bias and variance are dominated by those of the single term 
j\j3j - n(x ,a)^(^2^-)^\xo) - rj(x , a.)l {j=0} based on (A.4) since L is 
a lower triangle matrix. □ 

Proof of Theorem 1. The result of Theorem 1 is the special case of 
Proposition 1 for j = 0. □ 

Proof of Theorem 2. Note first that under conditions (B1)-(B5), we 
have \\d -a \\ = ?i _1/2 O p (l) by Theorem 3.2 of White (1982). This implies 
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that ~(Q u (/3; h, xq, olq) — Q u (/3; h, xq, a)) — ► 0. Note that condition (Al) im- 
plies that both Q u (/3;h,xo,oto) and Q u (/3;h,xo,a)) are strictly concave in 

(3. Consequently, ||/3(xo, a ) — (3(xq, a)\\ — > as we have obtained the asymp- 
totic result of 0(xQ,ao). 

Note that ±(Q u (f3;h,xo,ao) -Q u (f3;h,x ,a)) = O p (l/^/n), ^\\-^Qu((3;h, 

xo,ol ) - ^Q u ((3;h,x ,a)\\ = O p (l/^/n), ^\\g^rQu(P;h,x ,a ) - 

j^^Qu{P]h,XQ,OL)\\F = O p {l/ y/n) for every f3 where || • \\f denotes the 
matrix's Frobenius norm defined as the square root of the sum of squares of 
each element. With consistency established above, we can consider a local 
compact set. By the standard argument of the Taylor expansion used for 
proving asymptotic normality, we get ||/3(:eo, ^o) — 0( x o^)\\ = n~ l / 2 O p {l) 
which is faster than the convergence rate in our Theorem 1. Hence using 
estimated a does not affect our asymptotic convergence rates as desired. 
□ 
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